- Home
- Engineering
- increasing the action gap - new operators for reinforcement learning

prev

next

out of 23

Published on

06-Apr-2017View

57Download

1

Transcript

IncreasingtheActionGap:NewOperatorsforReinforcementLearning2017/03/18@NIPS+2 :: :: D1 ryo.iwaki@ams.eng.osaka-u.ac.jp :: ::31NIPSSafeandEfficientOff-PolicyReinforcementLearningSafe and efficient off-policy reinforcement learningRemi Munosmunos@google.comGoogle DeepMindTom Stepletonstepleton@google.comGoogle DeepMindAnna Harutyunyananna.harutyunyan@vub.ac.beVrije Universiteit BrusselMarc G. Bellemarebellemare@google.comGoogle DeepMindAbstractIn this work, we take a fresh look at some old and new algorithms for off-policy,return-based reinforcement learning. Expressing these in a common form, we de-rive a novel algorithm, Retrace(), with three desired properties: (1) low variance;(2) safety, as it safely uses samples collected from any behaviour policy, whateverits degree of off-policyness; and (3) efficiency, as it makes the best use of sam-ples collected from near on-policy behaviour policies. We analyse the contractivenature of the related operator under both off-policy policy evaluation and controlsettings and derive online sample-based algorithms. To our knowledge, this is thefirst return-based off-policy control algorithm converging a.s. to Q without theGLIE assumption (Greedy in the Limit with Infinite Exploration). As a corollary,we prove the convergence of Watkins Q(), which was still an open problem. Weillustrate the benefits of Retrace() on a standard suite of Atari 2600 games.One fundamental trade-off in reinforcement learning lies in the definition of the update target: shouldone estimate Monte Carlo returns or bootstrap from an existing Q-function? Return-based meth-ods (where return refers to the sum of discounted rewardsPttrt) offer some advantages overvalue bootstrap methods: they are better behaved when combined with function approximation, andquickly propagate the fruits of exploration (Sutton, 1996). On the other hand, value bootstrap meth-ods are more readily applied to off-policy data, a common use case. In this paper we show thatlearning from returns need not be at cross-purposes with off-policy learning.We start from the recent work of Harutyunyan et al. (2016), who show that naive off-policy policyevaluation, without correcting for the off-policyness of a trajectory, still converges to the desiredQ value function provided the behavior and target policies are not too far apart (the maxi-mum allowed distance depends on the parameter). Their Q() algorithm learns from trajectoriesgenerated by simply by summing discounted off-policy corrected rewards at each time step. Un-fortunately, the assumption that and are close is restrictive, as well as difficult to uphold in thecontrol case, where the target policy is always greedy with respect to the current Q-function. In thatsense this algorithm is not safe: it does not handle the case of arbitrary off-policyness.Alternatively, the Tree-backup (TB) () algorithm (Precup et al., 2000) tolerates arbitrary tar-get/behavior discrepancies by scaling information (here called traces) from future temporal dif-ferences by the product of target policy probabilities. TB() is not efficient in the near on-policycase (similar and ), though, as traces may be cut prematurely, blocking learning from full returns.In this work, we express several off-policy, return-based algorithms in a common form. From thiswe derive an improved algorithm, Retrace(), which is both safe and efficient, enjoying convergenceguarantees for off-policy policy evaluation and more importantly for the control setting.Retrace() can learn from full returns retrieved from past policy data, as in the context of experiencereplay (Lin, 1993), which has returned to favour with advances in deep reinforcement learning (MniharXiv:1606.02647v1 [cs.LG] 8 Jun 2016NIPS 2016 slideshare45difficult and engaging for human players. We used the same networkarchitecture, hyperparameter values (see Extended Data Table 1) andlearning procedure throughouttaking high-dimensional data (210|160colour video at 60 Hz) as inputto demonstrate that our approachrobustly learns successful policies over a variety of games based solelyon sensory inputs with only very minimal prior knowledge (that is, merelythe input data were visual images, and the number of actions availablein each game, but not their correspondences; see Methods). Notably,our method was able to train large neural networks using a reinforce-ment learning signal and stochastic gradient descent in a stable mannerillustrated by the temporal evolution of two indices of learning (theagents average score-per-episode and average predicted Q-values; seeFig. 2 and Supplementary Discussion for details).We compared DQN with the best performing methods from thereinforcement learning literature on the 49 games where results wereavailable12,15. In addition to the learned agents, we also report scores fora professional human games tester playing under controlled conditionsand a policy that selects actions uniformly at random (Extended DataTable 2 and Fig. 3, denoted by 100% (human) and 0% (random) on yaxis; see Methods). Our DQN method outperforms the best existingreinforcement learning methods on 43 of the games without incorpo-rating any of the additional prior knowledge about Atari 2600 gamesused by other approaches (for example, refs 12, 15). Furthermore, ourDQN agent performed at a level that was comparable to that of a pro-fessional human games tester across the set of 49 games, achieving morethan 75% of the human score on more than half of the games (29 games;Convolution Convolution Fully connected Fully connectedNo inputFigure 1 | Schematic illustration of the convolutional neural network. Thedetails of the architecture are explained in the Methods. The input to the neuralnetwork consists of an 84 3 84 3 4 image produced by the preprocessingmap w, followed by three convolutional layers (note: snaking blue linesymbolizes sliding of each filter across input image) and two fully connectedlayers with a single output for each valid action. Each hidden layer is followedby a rectifier nonlinearity (that is, max 0,x ).a bc d 0 200 400 600 800 1,000 1,200 1,400 1,600 1,800 2,000 2,200 0 20 40 60 80 100 120 140 160 180 200Average score per episodeTraining epochs 0 1 2 3 4 5 6 7 8 9 10 11 0 20 40 60 80 100 120 140 160 180 200Average action value (Q)Training epochs 0 1,000 2,000 3,000 4,000 5,000 6,000 0 20 40 60 80 100 120 140 160 180 200Average score per episodeTraining epochs 0 1 2 3 4 5 6 7 8 9 10 0 20 40 60 80 100 120 140 160 180 200Average action value (Q)Training epochsFigure 2 | Training curves tracking the agents average score and averagepredicted action-value. a, Each point is the average score achieved per episodeafter the agent is run with e-greedy policy (e 5 0.05) for 520 k frames on SpaceInvaders. b, Average score achieved per episode for Seaquest. c, Averagepredicted action-value on a held-out set of states on Space Invaders. Each pointon the curve is the average of the action-value Q computed over the held-outset of states. Note that Q-values are scaled due to clipping of rewards (seeMethods). d, Average predicted action-value on Seaquest. See SupplementaryDiscussion for details.RESEARCH LETTER5 3 0 | N A T U R E | V O L 5 1 8 | 2 6 F E B R U A R Y 2 0 1 5Macmillan Publishers Limited. All rights reserved2015AI [Mnih+15] IEEE ROBOTICS & AUTOMATION MAGAZINE MARCH 2016104regression [2]. With a function approximator, the sampled data from the approximated model can be generated by inap-propriate interpolation or extrapolation that improperly up-dates the policy parameters. In addition, if we aggressively derive the analytical gradi-ent of the approximated model to update the poli-cy, the approximated gra-dient might be far from the true gradient of the objective function due to the model approximation error. If we consider using these function approxima-tion methods for high-di-mensional systems like humanoid robots, this problem becomes more serious due to the difficul-ty of approximating high-dimensional dynamics models with a limited amount of data sampled from real systems. On the other hand, if the environ-ment is extremely stochastic, a limited amount of previously acquired data might not be able to capture the real environ-ments property and could lead to inappropriate policy up-dates. However, rigid dynamics models, such as a humanoid robot model, do not usually include large stochasticity. There-fore, our approach is suitable for a real robot learning for high-dimensional systems like humanoid robots.Moreover, applying RL to actual robot control is difficult, since it usually requires many learning trials that cannot be exe-cuted in real environments, and the real systems durability is limited. Previous studies used prior knowledge or properly de-signed initial trajectories to apply RL to a real robot and im-proved the robot controllers parameters [1], [4], [10], [19], [32].We applied our proposed learning method to our human-oid robot [7] (Figure 13) and show that it can accomplish two different movement-learning tasks without any prior knowl-edge for the cart-pole swing-up task or with a very simple nominal trajectory for the basketball-shooting task.The proposed recursive use of previously sampled data to improve policies for real robots would also be useful for other policy search algorithms, such as reward weighted re-gression [11] or information theoretic approaches [12], and it might be interesting to investigate how these combinations work as a future study.ConclusionsIn this article, we proposed reusing the previous experienc-es of a humanoid robot to efficiently improve its task per-formance. We proposed recursively using the off-policy PGPE method to improve the policies and applied our ap-proach to cart-pole swing-up and basketball-shooting tasks. In the former, we introduced a real-virtual hybrid task environment composed of a motion controller and vir-tually simulated cart-pole dynamics. By using the hybrid environment, we can potentially design a wide variety of different task environments. Note that complicated arm movements of the humanoid robot need to be learned for the cart-pole swing-up. Furthermore, by using our pro-posed method, the challenging basketball-shooting task was successfully accomplished.Future work will develop a method based on a transfer learning [28] approach to efficiently reuse the previous expe-riences acquired in different target tasks.AcknowledgmentThis work was supported by MEXT KAKENHI Grant 23120004, MIC-SCOPE, ``Development of BMI Technolo-gies for Clinical Application carried out under SRPBS by AMED, and NEDO. Part of this study was supported by JSPS KAKENHI Grant 26730141. This work was also supported by NSFC 61502339.References[1] A. G. Kupcsik, M. P. Deisenroth, J. Peters, and G. Neumann, Data-effi-cient contextual policy search for robot movement skills, in Proc. National Conf. Artificial Intelligence, 2013.[2] C. E. Rasmussen and C. K. I. Williams Gaussian Processes for Machine Learning. Cambridge, MA: MIT Press, 2006.[3] C. G. Atkeson and S. Schaal, Robot learning from demonstration, in Proc. 14th Int. Conf. Machine Learning, 1997, pp. 1220.[4] C. G. Atkeson and J. Morimoto, Nonparametric representation of poli- cies and value functions: A trajectory-based approach, in Proc. Neural Infor-mation Processing Systems, 2002, pp. 16431650.Efficiently reusing previous experiences is crucial to improve its behavioral policies without actually interacting with real environments.Figure 13. The humanoid robot CB-i [7]. (Photo courtesy of ATR.) [Sugimoto+16]IBM Research / Center for Business OptimizationModeling and Optimization EngineActionsOtherSystem 1System 2System 3 Event ListenerEventNotificationEventNotificationEventNotification< inserts >TP ProfileTaxpayer State ( Current )ModelerOptimizer< input to >State Generator< input to >Case Inventory < reads >< input to >Allocation RulesResource Constraints< input to >< inserts , updates >Business Rules< input to >< generates >Segment Selector Action 1 Cnt Action 2 Cnt Action n Cnt1 C1 C2 V C3 200 50 02 C4 V C1 C7 0 50 250TP ID Feat 1 Feat 2 Feat n123456789 00 5 A 1500122334456 01 0 G 1600122118811 03 9 G 1700Rule Processor< input to >< input to >Recommended Actions< inserts , updates >TP ID Rec. Date Rec. Action Start Date123456789 00 6/21/2006 A1 6/21/2006122334456 01 6/20/2006 A2 6/20/2006122118811 03 5/31/2006 A2Action Handler< input to >New CaseCase ExtractScheduler< starts >< updates >StateTime ExpiredEventNotification< input to >Taxpayer State HistoryStateTP ID State Date Feat 1 Feat 2 Feat n123456789 00 6/1/2006 5 A 1500122334456 01 5/31/2006 0 G 1600122118811 03 4/16/2006 4 R 922122118811 03 4/20/2006 9 G 1700< inserts >Feature Definitions(XML)(XSLT)(XML)(XML)(XSLT)Figure 2: Overall collections system architecture.see the full picture. This also means that, after a recommen-dation is generated for each taxpayer, conflicts may need beconsolidated with these relationships in mind. For example,it may not be desirable to send related taxpayers to differentorganizations. This adjustment is part of what is performedby the module named Rule Processor in the figure. Ad-ditionally, there is a mechanism in place that can be usedto deploy two models in an incumbent/challenger paradigm,until such time as there is sufficient statistical significanceto call a winner. The loser will be replaced by the winneror a new model when available.There are three types of users at NYS DTF. The firsttype is the members of the IT department within DTF, whoare responsible for operating and maintaining the collectionsoptimization system. The second type is the project team,of size ranging from 5 to 15, that represents the users of thecollections optimzation system, who perform validation ofthe rules and output allocations, and oversee the process ofsetting resource constraints and other inputs to the system.The third is the actual collections agents who are directlyaffected by the systems recommended actions. Of these,30 to 40 call center agents receive the recommended actionsvia an application screen, whereas the hundreds of agentsin the specialized organizations get them indirectly via theengines allocations to the respective organizations.The engine itself is a result of multiple years of researchand development, and with all of its components, the rel-evant efforts sum to about 5 million dollars in research in-vestment, although some of the components (modeling andoptimization engines) are generic. Additionally, NYS DTFinvested approximately 4 million dollars for this engagement.The software maintenance planning is underway by IBMGlobal Business Services, as part of the assetization andcommercialization of the engine.5.9 Lift EvaluationA challenge in evaluating the performance of a data-drivenbusiness optimization methodology is that we are typicallyrequired to do so using only historical data, which was col-lected using a different (sampling) policy. Here we employan evaluation method based on bias correction that allowsus to do so, essentially following [1].A useful quantity to estimate is the expected cumulativereward for a new policy , when sampled with respect to theold (or empirical) policy and state distribution , writtenR,() and defined as follows.R,() = Es,[Ea(a|s)[R(s, a)]]We can estimate the above quantity using the sampling pol-icy, with appropriate bias correction (c.f. [19]), in which theobserved reward is multiplied by the ratio between the prob-abilities assigned to the observed action by the respectivepoliciesR,() = Es,a,[(a|s)(a|s)[R(s, a)]]Note, in the above, that (a|s) is known since it is thestochastic policy generated by the constrained reinforcementlearning procedure, but (a|s) needs to be estimated fromthe data, since we do not know the sampling policy explicitly.In our evaluation, we estimate using Laplace estimates ineach of the segments that were output by the segmentedlinear regression algorithm used for function approximationin the learning procedure.Figire 3 plots the expected rewards of the collections poli-cies output by our engine in a typical modeling run, as com-pared to the sampling or DTFs historical policy, as a func-tion of the number of learning iterations. The advantageover the sampling policy is significant, and the models ob-tained over the learning iterations exhibit a steady improve-ment.6. EVALUATION ON OTHER DATA SETSIn order to assess the robustness of our proposed method-ology, we applied our proposed method and evaluated itsperformance on two other real world data sets.6.1 The Data SetsThe first data set we use is based on the well-known dona-tion data set from KDD Cup 1998, which is available fromthe UCI KDD repository.1 This data set contains informa-tion from direct mail promotions soliciting donations, and1http://kdd.ics.uci.edu/ [Abe+10] [Silver+16]IncreasingtheActionGap:NewOperatorsforReinforcementLearning Bellman greedy actiongap Atari2600Increasing the Action Gap:New Operators for Reinforcement LearningMarc G. Bellemare and Georg Ostrovski and Arthur GuezPhilip S. Thomas and Remi MunosGoogle DeepMind{bellemare,ostrovski,aguez,munos}@google.com; philipt@cs.cmu.eduAbstractThis paper introduces new optimality-preserving oper-ators on Q-functions. We first describe an operator fortabular representations, the consistent Bellman opera-tor, which incorporates a notion of local policy con-sistency. We show that this local consistency leads toan increase in the action gap at each state; increasingthis gap, we argue, mitigates the undesirable effectsof approximation and estimation errors on the inducedgreedy policies. This operator can also be applied todiscretized continuous space and time problems, andwe provide empirical results evidencing superior per-formance in this context. Extending the idea of a locallyconsistent operator, we then derive sufficient conditionsfor an operator to preserve optimality, leading to a fam-ily of operators which includes our consistent Bellmanoperator. As corollaries we provide a proof of optimal-ity for Bairds advantage learning algorithm and deriveother gap-increasing operators with interesting proper-ties. We conclude with an empirical study on 60 Atari2600 games illustrating the strong potential of these newoperators.Value-based reinforcement learning is an attractive solu-tion to planning problems in environments with unknown,unstructured dynamics. In its canonical form, value-basedreinforcement learning produces successive refinements ofan initial value function through repeated application of aconvergent operator. In particular, value iteration (Bellman1957) directly computes the value function through the iter-ated evaluation of Bellmans equation, either exactly or fromsamples (e.g. Q-Learning, Watkins 1989).In its simplest form, value iteration begins with an initialvalue function V0 and successively computes Vk+1 := T Vk,where T is the Bellman operator. When the environment dy-namics are unknown, Vkis typically replaced by Qk, thestate-action value function, and T is approximated by anempirical Bellman operator. The fixed point of the Bellmanoperator, Q, is the optimal state-action value function oroptimal Q-function, from which an optimal policy can berecovered.In this paper we argue that the optimal Q-function is in-consistent, in the sense that for any action a which is subop-Now at Carnegie Mellon University.Copyright c 2016, Association for the Advancement of ArtificialIntelligence (www.aaai.org). All rights reserved.timal in state x, Bellmans equation for Q(x, a) describesthe value of a nonstationary policy: upon returning to x, thispolicy selects (x) rather than a. While preserving globalconsistency appears impractical, we propose a simple mod-ification to the Bellman operator which provides us a witha first-order solution to the inconsistency problem. Accord-ingly, we call our new operator the consistent Bellman oper-ator.We show that the consistent Bellman operator gener-ally devalues suboptimal actions but preserves the set ofoptimal policies. As a result, the action gap the valuedifference between optimal and second best actions in-creases. Increasing the action gap is advantageous in thepresence of approximation or estimation error (Farahmand2011), and may be crucial for systems operating at a finetime scale such as video games (Togelius et al. 2009;Bellemare et al. 2013), real-time markets (Jiang and Pow-ell 2015), and robotic platforms (Riedmiller et al. 2009;Hoburg and Tedrake 2009; Deisenroth and Rasmussen 2011;Sutton et al. 2011). In fact, the idea of devaluating subop-timal actions underpins Bairds advantage learning (Baird1999), designed for continuous time control, and occurs nat-urally when considering the discretized solution of contin-uous time and space MDPs (e.g. Munos and Moore 1998;2002), whose limit is the Hamilton-Jacobi-Bellman equation(Kushner and Dupuis 2001). Our empirical results on the bi-cycle domain (Randlov and Alstrom 1998) show a markedincrease in performance from using the consistent Bellmanoperator.In the second half of this paper we derive novel sufficientconditions for an operator to preserve optimality. The rela-tive weakness of these new conditions reveal that it is possi-ble to deviate significantly from the Bellman operator with-out sacrificing optimality: an optimality-preserving operatorneeds not be contractive, nor even guarantee convergence ofthe Q-values for suboptimal actions. While numerous alter-natives to the Bellman operator have been put forward (e.g.recently Azar et al. 2011; Bertsekas and Yu 2012), we be-lieve our work to be the first to propose such a major depar-ture from the canonical fixed-point condition required froman optimality-preserving operator. As proof of the richnessof this new operator family we describe a few practical in-stantiations with unique properties.We use our operators to obtain state-of-the-art empiricalarXiv:1512.04860v1 [cs.AI] 15 Dec 2015AAAI 20167 Q x 2 Xa 2 AP : X A! Pr (X )R : X A! [RMAX, RMAX] : X ! A, 2 Q : X A! R, Q 2 Qx aQ(x, a) , E 1Xt=0tR(xt, at)|xt = x, at = a, 2 [0, 1)8 greedyT Q(x, a) , R(x, a) + EP maxb2AQ(x0, b)[Bertsekas&Tsitsiklis 96]QQQQQQQ = QQQQQTTQ(x, a) = R(x, a) + EP maxb2AQ(x0, b)EP, Ex0P (|x,a)(x) , argmaxa2AQ(x, a) 8x 2 XQ(x, a) = R(x, a) + EP Q(x0,(x0))9 2 greedyresults on the Arcade Learning Environment (Bellemare etal. 2013). We consider the Deep Q-Network (DQN) archi-tecture of Mnih et al. (2015), replacing only its learning rulewith one of our operators. Remarkably, this one-line changeproduces agents that significantly outperform the originalDQN. Our work, we believe, demonstrates the potential im-pact of rethinking the core components of value-based rein-forcement learning.BackgroundWe consider a Markov decision process M :=(X , A, P, R, ) where X is the state space, A is thefinite action space, P is the transition probability kernel,R is the reward function mapping state-action pairs toa bounded subset of R, and 2 [0, 1) is the discountfactor. We denote by Q := QX ,A and V := VX the spaceof bounded real-valued functions over X A and X ,respectively. For Q 2 Q we write V (x) := maxaQ(x, a),and follow this convention for related quantities ( V for Q,V0 for Q0, etc.) whenever convenient and unambiguous. Inthe context of a specific (x, a) 2 X A we further writeEP:= Ex0P ( | x,a) to mean the expectation with respectto P ( | x, a), with the convention that x0 always denotes thenext state random variable.A deterministic policy : X ! A induces a Q-functionQ 2 Q whose Bellman equation isQ(x, a) := R(x, a) + EPQ(x0,(x0)).The state-conditional expected return V (x) :=Q(x,(x)) is the expected discounted total rewardreceived from starting in x and following .The Bellman operator T : Q ! Q is defined pointwiseasT Q(x, a) := R(x, a) + EPmaxb2AQ(x0, b). (1)T is a contraction mapping in supremum norm (Bertsekasand Tsitsiklis 1996) whose unique fixed point is the optimalQ-functionQ(x, a) = R(x, a) + EPmaxb2AQ(x0, b),which induces the optimal policy :(x) := arg maxa2AQ(x, a) 8x 2 X .A Q-function Q 2 Q induces a greedy policy (x) :=arg maxaQ(x, a), with the property that Q = Q if andonly if Q = Q. For x 2 X we call (x) the greedy ac-tion with respect to Q and a 6= (x) a nongreedy action;for these are the usual optimal and suboptimal actions,respectively.We emphasize that while we focus on the Bellman op-erator, our results easily extend to its variations such asSARSA (Rummery and Niranjan 1994), policy evaluation(Sutton 1988), and fitted Q-iteration (Ernst, Geurts, and We-henkel 2005). In particular, our new operators all have asample-based form, i.e., an analogue to the Q-Learning ruleof Watkins (1989).a1a2p = 1, r = 0r = 1Bad Statex1 x2cakeno cakeV = 2(1 + )p = 1/2,Figure 1: A two-state MDP illustrating the non-stationaryaspect of the Bellman operator. Here, p and r indicate tran-sition probabilities and rewards, respectively. In state x1 theagent may either eat cake to receive a reward of 1 and transi-tion to x2 with probability 12 , or abstain for no reward. Statex2 is a low-value absorbing state with > 0.The Consistent Bellman OperatorIt is well known (and implicit in our notation) that the op-timal policy for M is stationary (i.e., time-independent)and deterministic. In looking for , we may therefore re-strict our search to the space of stationary deterministicpolicies. Interestingly, as we now show the Bellman opera-tor on Q is not, in a sense, restricted to .To begin, consider the two-state MDP depicted in Figure1. This MDP abstracts a Faustian situation in which an agentrepeatedly chooses between an immediately rewarding butultimately harmful option (a1), or an unrewarding alterna-tive (a2). For concreteness, we imagine the agent as facedwith an endless supply of delicious cake (with > 0) andcall these the cake and no cake actions.Eating cake can cause a transition to x2, the bad state,whose value is independent of the agents policy:V(x2) := 2(1 + )18 2 .In state x1, however, the Q-values depend on the agents fu-ture behaviour. For a policy 2 , the value of a1 isQ(x1, a1) = 1 + 12V(x1) +12V(x2)(2)= 1 +2V(x1) (1 + ) =2V(x1) .By contrast, the value of a2 isQ(x1, a2) = 0 + V(x1),which is greater than Q(x1, a1) for all . It follows that noteating cake is optimal, and thus V (x1) = Q(x1, a2) = 0.Furthermore, (2) tells us that the value difference betweenoptimal and second best action, or action gap, isQ(x1, a2)Q(x1, a1) = .Notice that Q(x1, a1) = does not describe the value ofany stationary policy. That is, the policy with (x1) = a1has valueV(x1) = +2V(x1) =1 /2 , (3)and in particular this value is lower than Q(x1, a1). In-stead, Q(x1, a1) describes the value of a nonstationary pol-icy which eats cake once, but then subsequently abstains.Q(x1, a1) =2V(x1) V(x2) , 2(1 + ), 8 2 Q(x1, a2) = V(x1)Q(x1, a1) < Q(x1, a2), 8 2 Nocake actiongapV(x1) = Q(x1, a2) = 0Q(x1, a2)Q(x1, a1) = 10 =actiongap MDP greedygreedy11 00ConsistentBellmanoperatorTCQ(x, a) , R(x, a) + EPhI[x 6=x0] maxb2AQ(x0, b) + I[x=x0]Q(x, a)i= T Q(x, a) P (x|x, a)hmaxb2AQ(x, b)Q(x, a)iP (x|x, a)=maxstatevalue actionvalue actiongaps,aAs,aQsV12 AdvantageLearningoperator actiongap =advantage[Baird99] advantage PersistentAdvantageLearningoperator cf.Atari2600,60Hz(Persistent)AdvantagelearningoperatorTPALQ(x, a) , maxnTALQ(x, a), R(x, a) + EPQ(x0, a)o=maxstatevalue actionvalue actiongaps,aAs,aQsVTALQ(x, a) , T Q(x, a) hmaxb2AQ(x, b)Q(x, a)i13 optimalitypreserving: gap-increasing: ::T 0Q(x, a) T Q(x, a)T 0Q(x, a) T Q(x, a) hmaxb2AQ(x, b)Q(x, a)i 2 [0, 1)T 0Q 2 Q, x 2 X , a 2 AQk+1 , T 0QkQ0 2 Q, x 2 XQ(x, a) < V (x) =) lim supk!1Qk(x, a) < V(x)a 2 AQ0 2 Q, x 2 X , a 2 Alim infk!1hVk(x)Qk(x, a)i V (x)Q(x, a)limk!1maxb2AQk(x, a) = V(x)14 actiongap optimalitypreserving: gap-increasing: actiongap actiongap& ::T 0TT 0T 0T V Q15 9 {0} {0} ::1:: & 2' , (!, !, , , , r) Goalr2 Randlovs ModelUsing Randlov and Alstroms bicycle model and envi-ronment, we trained a learner using the SARSA() al-gorithm. We implemented Randlovs balance and go-totasks with PyBrain, a machine learning library for Python[9]. The episodic learning framework is illustrated in Fig-ure 1. To evaluate the execution of the task during learn-ing, we will additionally define a rehearsal sequence. Inone rehearsal sequence, the agent learns by executing 10episodes, after which, the agent performs one episodegreedily without learning. We use the discounted sumof rewards achieved over the performance episode as asuccess metric.We used Randlovs reward function for both tasks. Forthe balance task, a reward of -1 was given when the ab-solute tilt angle exceeded 30 and 0 otherwise. For thego-to task, the following reward was given.if abs(!) > 12r = 1elser = (4 2) 0.0004Figure 2 shows the results from Randlovs balancetask. Our learner was able to balance indefinitely afterapproximately 1500 trials. We observed cases in whichthe agent would fall travel in stable, circular trajectories.This was also observed by Randlov. Figure 3 shows theresulting implementation of Randlovs go-to task. Theagent lfirst learns to balance, and it first reaches thegoal after approximately 4500 trials. Both of the tasksare very sensitive to learning rate annealing. Randlovdoes not mention this, despite the fact that seems to bewell-known in the field [10]. Without annealing, nei-ther task will result in a converging policy because thecurrent learning continually discards all the learning thathad already been done. The go-to task required a muchslower annealing rate than did the balance task becausethe agent first needs to gain experience balancing beforeattempting to navigate toward the goal.Figure 2: (a) Learning curve for the balance task. (b) The tra-jectories of the bicycles wheel contact points in each of severalthousand episodes. Simulation was stopped after 1000 seconds,so trajectories may appear to terminate. The agent also foundseveral stable circular trajectories, as the one highlighted above.Figure 3: (a) Learning curve for the go-to task. (b) The tra-jectories of the bicycles wheel contact points after several thou-sand episodes. Here the goal was placed at a position (x =10m, y = 50 m) relative to the bicycle starting location. Afterabout 4000 episodes (400 rehearsals), the bicycle reaches thegoal with increasing frequency.Now, building on Randlovs implementation, we ex-plore our topics of interesthuman ability to control avirtual bicycle, and shaping of the reinforcement rewardfunction.Figure 1: Learning methodology, in PyBrains terminology.2! dd 2[Randolv &Alstrom 98]16 ::1:: ::rewards driving away from the goal. Instead we use the fol-lowing related reward functionR(x, a) :=8 T Q(x, a) or2. T 0Q(x, a) < T Q(x, a) [V (x)Q(x, a)],and in both cases there exists a Q0 2 Q for whichlimk!1maxa(T 0)kQ0(x, a) 6= V (x).We note that the above remark does not cover the casewhere condition (2) is an equality (i.e., = 1). We leaveas an open problem the existence of a divergent example for = 1.Corollary 1. The consistent Bellman operator TC (8) andconsistent Q-value interpolation Bellman operator TCQVI(10) are optimality-preserving.In fact, it is not hard to show that the consistent Bellmanoperator (7) is a contraction, and thus enjoys even strongerconvergence guarantees than those provided by Theorem 1.Informally, whenever Condition 2 of the theorem is strength-ened to an inequality, we may also expect our operators tobe gap-increasing; this is in fact the case for both of our con-sistent operators.To conclude this section, we describe a few operatorswhich satisfy the conditions of Theorem 1, and are thusoptimality-preserving and gap-increasing. Critically, none ofthese operators are contractions; one of them, the lazy op-erator, also possesses multiple fixed points.Bairds Advantage LearningThe method of advantage learning was proposed by Baird(1999) as a means of increasing the gap between the optimaland suboptimal actions in the context of residual algorithmsapplied to continuous time problems.4 The correspondingoperator isT 0Q(x, a) = K1R(x, a) +t EPV (x0) + (K 1)V (x),3When two or more actions are optimal, we are only guaranteedthat one of them will ultimately be correctly valued. The 1-lazyoperator described below exemplifies this possibility.4Advantage updating, also by Baird, is a popular but differentidea where an agent maintains both V and A := Q V .BellmanoperatorConsistentoperatorFraction of Episodes Ending In FallIterationsConsistentoperator BellmanoperatorIterationsFraction of Episodes Ending At GoalConsistent operatorBellman operatorStartGoalFigure 3: Top. Falling and goal-reaching frequency forgreedy policies derived from value iteration. Bottom. Sam-ple bicycle trajectories after 100, 200, . . . , 1000 iterations.In this coarse-resolution regime, the Bellman operator ini-tially yields policies which circle the goal forever, while theconsistent operator quickly yields successful trajectories.A Family of Convergent OperatorsOne may ask whether it is possible to extend the consistentBellman operator to Q-value approximation schemes whichlack a probabilistic interpretation, such as linear approxi-mation (Sutton 1996), locally weighted regression (Atkeson1991), neural networks (Tesauro 1995), or even information-theoretic methods (Veness et al. 2015). In this section weanswer by the affirmative.The family of operators which we describe here are ap-plicable to arbitrary Q-value approximation schemes. Whilethese operators are in general no longer contractions, theyare gap-increasing, and optimality-preserving when the Q-function is represented exactly. Theorem 1 is our main re-sult; one corollary is a convergence proof for Bairds ad-vantage learning (Baird 1999). Incidentally, our taking theminimum in (10) was in fact no accident, but rather a simpleapplication of this theorem.Theorem 1. Let T be the Bellman operator defined by (1).Let T 0 be an operator with the property that there exists an 2 [0, 1) such that for all Q 2 Q, x 2 X , a 2 A, andletting V (x) := maxbQ(x, b),1. T 0Q(x, a) T Q(x, a), and2. T 0Q(x, a) T Q(x, a) [V (x)Q(x, a)].Then T 0 is both optimality-preserving and gap-increasing.Thus any operator which satisfies the conditions of Theo-rem 1 will eventually yield an optimal greedy policy, assum-ing an exact representation of the Q-function. Condition 2,in particular, states that we may subtract up to (but not in-cluding) maxbQk(x, b) Qk(x, a) from Qk(x, a) at eachiteration. This is exactly the action gap at (x, a), but for Qk,rather than the optimal Q. For a particular x, this implieswe may initially devalue the optimal action a := (x) infavour of the greedy action. But our theorem shows that acannot be undervalued infinitely often, and in fact Qk(x, a)must ultimately reach V (x).3 The proof of this perhaps sur-prising result may be found in the appendix.To the best of our knowledge, Theorem 1 is the firstresult to show the convergence of iterates of dynamicprogramming-like operators without resorting to a contrac-tion argument. Indeed, the conditions of Theorem 1 are par-ticularly weak: we do not require T 0 to be a contraction,nor do we assume the existence of a fixed point (in the Q-function space Q) of T 0. In fact, the conditions laid out inTheorem 1 characterize the set of optimality-preserving op-erators on Q, in the following sense:Remark 1. There exists a single-state MDP M and an op-erator T 0 with either1. T 0Q(x, a) > T Q(x, a) or2. T 0Q(x, a) < T Q(x, a) [V (x)Q(x, a)],and in both cases there exists a Q0 2 Q for whichlimk!1maxa(T 0)kQ0(x, a) 6= V (x).We note that the above remark does not cover the casewhere condition (2) is an equality (i.e., = 1). We leaveas an open problem the existence of a divergent example for = 1.Corollary 1. The consistent Bellman operator TC (8) andconsistent Q-value interpolation Bellman operator TCQVI(10) are optimality-preserving.In fact, it is not hard to show that the consistent Bellmanoperator (7) is a contraction, and thus enjoys even strongerconvergence guarantees than those provided by Theorem 1.Informally, whenever Condition 2 of the theorem is strength-ened to an inequality, we may also expect our operators tobe gap-increasing; this is in fact the case for both of our con-sistent operators.To conclude this section, we describe a few operatorswhich satisfy the conditions of Theorem 1, and are thusoptimality-preserving and gap-increasing. Critically, none ofthese operators are contractions; one of them, the lazy op-erator, also possesses multiple fixed points.Bairds Advantage LearningThe method of advantage learning was proposed by Baird(1999) as a means of increasing the gap between the optimaland suboptimal actions in the context of residual algorithmsapplied to continuous time problems.4 The correspondingoperator isT 0Q(x, a) = K1R(x, a) +t EPV (x0) + (K 1)V (x),3When two or more actions are optimal, we are only guaranteedthat one of them will ultimately be correctly valued. The 1-lazyoperator described below exemplifies this possibility.4Advantage updating, also by Baird, is a popular but differentidea where an agent maintains both V and A := Q V .10-grid18 DeepQNetwork Qlearning+CNN Q FittedQ ::2::Atari2600difficult and engaging for human players. We used the same networkarchitecture, hyperparameter values (see Extended Data Table 1) andlearning procedure throughouttaking high-dimensional data (210|160colour video at 60 Hz) as inputto demonstrate that our approachrobustly learns successful policies over a variety of games based solelyon sensory inputs with only very minimal prior knowledge (that is, merelythe input data were visual images, and the number of actions availablein each game, but not their correspondences; see Methods). Notably,our method was able to train large neural networks using a reinforce-ment learning signal and stochastic gradient descent in a stable mannerillustrated by the temporal evolution of two indices of learning (theagents average score-per-episode and average predicted Q-values; seeFig. 2 and Supplementary Discussion for details).We compared DQN with the best performing methods from thereinforcement learning literature on the 49 games where results wereavailable12,15. In addition to the learned agents, we also report scores fora professional human games tester playing under controlled conditionsand a policy that selects actions uniformly at random (Extended DataTable 2 and Fig. 3, denoted by 100% (human) and 0% (random) on yaxis; see Methods). Our DQN method outperforms the best existingreinforcement learning methods on 43 of the games without incorpo-rating any of the additional prior knowledge about Atari 2600 gamesused by other approaches (for example, refs 12, 15). Furthermore, ourDQN agent performed at a level that was comparable to that of a pro-fessional human games tester across the set of 49 games, achieving morethan 75% of the human score on more than half of the games (29 games;Convolution Convolution Fully connected Fully connectedNo inputFigure 1 | Schematic illustration of the convolutional neural network. Thedetails of the architecture are explained in the Methods. The input to the neuralnetwork consists of an 84 3 84 3 4 image produced by the preprocessingmap w, followed by three convolutional layers (note: snaking blue linesymbolizes sliding of each filter across input image) and two fully connectedlayers with a single output for each valid action. Each hidden layer is followedby a rectifier nonlinearity (that is, max 0,x ).a bc d 0 200 400 600 800 1,000 1,200 1,400 1,600 1,800 2,000 2,200 0 20 40 60 80 100 120 140 160 180 200Average score per episodeTraining epochs 0 1 2 3 4 5 6 7 8 9 10 11 0 20 40 60 80 100 120 140 160 180 200Average action value (Q)Training epochs 0 1,000 2,000 3,000 4,000 5,000 6,000 0 20 40 60 80 100 120 140 160 180 200Average score per episodeTraining epochs 0 1 2 3 4 5 6 7 8 9 10 0 20 40 60 80 100 120 140 160 180 200Average action value (Q)Training epochsFigure 2 | Training curves tracking the agents average score and averagepredicted action-value. a, Each point is the average score achieved per episodeafter the agent is run with e-greedy policy (e 5 0.05) for 520 k frames on SpaceInvaders. b, Average score achieved per episode for Seaquest. c, Averagepredicted action-value on a held-out set of states on Space Invaders. Each pointon the curve is the average of the action-value Q computed over the held-outset of states. Note that Q-values are scaled due to clipping of rewards (seeMethods). d, Average predicted action-value on Seaquest. See SupplementaryDiscussion for details.RESEARCH LETTER5 3 0 | N A T U R E | V O L 5 1 8 | 2 6 F E B R U A R Y 2 0 1 5Macmillan Publishers Limited. All rights reserved2015[Mnih+15] + tQ(x, a; )@Q(x, a; )@Q(x, a) = R(x, a) + maxb2AQ(x0, b; )Q(x, a; )ALQ(x, a) = Q(x, a) maxb2AQ(x, b; )Q(x, a; )PALQ(x, a) = maxALQ(x, a),Q(x, a) maxb2AQ(x0, b; )Q(x0, a; )19 60 DQN AL:27% PAL:35% Q ALPAL ::2::Atari2600::Advantage learningDQNPersistent A.L.Training Frames (Millions)Average ScoreASTERIXAdvantage learningDQNPersistent A.L.Average ScoreTraining Frames (Millions)SPACE INVADERSFigure 4: Learning curves for two Atari 2600 games in theOriginal DQN setting.AdvantagelearningDQNPersistent A.L.Episode StepsAction GapAdvantagelearningDQNPersistent A.L.Episode StepsEstimated State ValueFigure 5: Action gaps (left) and value functions (right) fora single episode of SPACE INVADERS (Original DQN set-ting). Our operators yield markedly increased action gapsand lower values.stems from the instability of the Q-functions learned fromBellman updates, and provided conclusive empirical evi-dence to this effect. In the spirit of their work, we comparedour learned Q-functions on a single trajectory generated by atrained DQN agent playing SPACE INVADERS in the Origi-nal DQN setting. For each Q-function and each state x alongthe trajectory, we computed V (x) as well as the action gapat x.The value functions and action gaps resulting from thisexperiment5 are depicted in Figure 5. As expected, the ac-tion gaps are significantly greater for both of our operators,in comparison to the action gaps produced by DQN. Fur-thermore, the value estimates are themselves lower, and cor-respond to more realistic estimates of the true value func-tion. In their experiments, van Hasselt et al. observed a sim-ilar effect on the value estimates when replacing the Bell-man updates with Double Q-Learning updates, one of manysolutions recently proposed to mitigate the negative impactof statistical bias in value function estimation (van Hasselt2010; Azar et al. 2011; Lee, Defourny, and Powell 2013).This bias is positive and is a consequence of the max term inthe Bellman operator. We hypothesize that the lower valueestimates observed in Figure 5 are also a consequence ofbias reduction. Specifically, increased action gaps are con-sistent with a bias reduction: it is easily shown that the valueestimation bias is strongest when Q-values are close to eachother. If our hypothesis holds true, the third benefit of in-creasing the action gap is thus to mitigate the statistical biasof Q-value estimates.5Videos: https://youtu.be/wDfUnMY3vF8Open QuestionsWeaker Conditions for Optimality. At the core of our re-sults lies the redefinition of Q-values in order to facilitateapproximate value estimation. Theorem 1 and our empiri-cal results indicate that there are many practical operatorswhich do not preserve suboptimal Q-values. Naturally, pre-serving the optimal value function V is itself unnecessary,as long as the iterates converge to a Q-function Q for whicharg maxaQ(x, a) = (x). It may well be that even weakerconditions for optimality exist than those required by Theo-rem 1. At the present, however, our proof technique does notappear to extend to this case.Statistical Efficiency of New Operators. Advantage learn-ing (as given by our redefinition) may be viewed as a gener-alization of the consistent Bellman operator when P ( | x, a)is unknown or irrelevant. In this light, we ask: is there aprobabilistic interpretation to advantage learning? We fur-ther wonder about the statistical efficiency of the consistentBellman operator: is it ever less efficient than the usual Bell-man operator, when considering the probability of misclassi-fying the optimal action? Both of these answers might shedsome light on the differences in performance observed in ourexperiments.Maximally Efficient Operator. Having revealed the exis-tence of a broad family of optimality-preserving operators,we may now wonder which of these operators, if any, shouldbe preferred to the Bellman operator. Clearly, there are triv-ial MDPs on which any optimality-preserving operator per-forms equally well. However, we may ask whether thereis, for a given MDP, a maximally efficient optimality-preserving operator; and whether a learning agent can ben-efit from simultaneously searching for this operator whileestimating a value function.Concluding RemarksWe presented in this paper a family of optimality-preservingoperators, of which the consistent Bellman operator is a dis-tinguished member. At the center of our pursuits lay the de-sire to increase the action gap; we showed through experi-ments that this gap plays a central role in the performance ofgreedy policies over approximate value functions, and howsignificantly increased performance could be obtained by asimple modification of the Bellman operator. We believe ourwork highlights the inadequacy of the classical Q-functionat producing reliable policies in practice, calls into questionthe traditional policy-value relationship in value-based rein-forcement learning, and illustrates how revisiting the con-cept of value itself can be fruitful.AcknowledgmentsThe authors thank Michael Bowling, Csaba Szepesvari,Craig Boutilier, Dale Schuurmans, Marty Zinkevich, LihongLi, Thomas Degris, and Joseph Modayil for useful discus-sions, as well as the anonymous reviewers for their excellentfeedback.Advantage learningDQNPersistent A.L.Training Frames (Millions)Average ScoreASTERIXAdvantage learningDQNPersistent A.L.Average ScoreTraining Frames (Millions)SPACE INVADERSFigure 4: Learning curves for two Atari 2600 games in theOriginal DQN setting.AdvantagelearningDQNPersistent A.L.Episode StepsAction GapAdvantagelearningDQNPersistent A.L.Episode StepsEstimated State ValueFigure 5: Action gaps (left) and value functions (right) fora single episode of SPACE INVADERS (Original DQN set-ting). Our operators yield markedly increased action gapsand lower values.stems from the instability of the Q-functions learned fromBellman updates, and provided conclusive empirical evi-dence to this effect. In the spirit of their work, we comparedour learned Q-functions on a single trajectory generated by atrained DQN agent playing SPACE INVADERS in the Origi-nal DQN setting. For each Q-function and each state x alongthe trajectory, we computed V (x) as well as the action gapat x.The value functions and action gaps resulting from thisexperiment5 are depicted in Figure 5. As expected, the ac-tion gaps are significantly greater for both of our operators,in comparison to the action gaps produced by DQN. Fur-thermore, the value estimates are themselves lower, and cor-respond to more realistic estimates of the true value func-tion. In their experiments, van Hasselt et al. observed a sim-ilar effect on the value estimates when replacing the Bell-man updates with Double Q-Learning updates, one of manysolutions recently proposed to mitigate the negative impactof statistical bias in value function estimation (van Hasselt2010; Azar et al. 2011; Lee, Defourny, and Powell 2013).This bias is positive and is a consequence of the max term inthe Bellman operator. We hypothesize that the lower valueestimates observed in Figure 5 are also a consequence ofbias reduction. Specifically, increased action gaps are con-sistent with a bias reduction: it is easily shown that the valueestimation bias is strongest when Q-values are close to eachother. If our hypothesis holds true, the third benefit of in-creasing the action gap is thus to mitigate the statistical biasof Q-value estimates.5Videos: https://youtu.be/wDfUnMY3vF8Open QuestionsWeaker Conditions for Optimality. At the core of our re-sults lies the redefinition of Q-values in order to facilitateapproximate value estimation. Theorem 1 and our empiri-cal results indicate that there are many practical operatorswhich do not preserve suboptimal Q-values. Naturally, pre-serving the optimal value function V is itself unnecessary,as long as the iterates converge to a Q-function Q for whicharg maxaQ(x, a) = (x). It may well be that even weakerconditions for optimality exist than those required by Theo-rem 1. At the present, however, our proof technique does notappear to extend to this case.Statistical Efficiency of New Operators. Advantage learn-ing (as given by our redefinition) may be viewed as a gener-alization of the consistent Bellman operator when P ( | x, a)is unknown or irrelevant. In this light, we ask: is there aprobabilistic interpretation to advantage learning? We fur-ther wonder about the statistical efficiency of the consistentBellman operator: is it ever less efficient than the usual Bell-man operator, when considering the probability of misclassi-fying the optimal action? Both of these answers might shedsome light on the differences in performance observed in ourexperiments.Maximally Efficient Operator. Having revealed the exis-tence of a broad family of optimality-preserving operators,we may now wonder which of these operators, if any, shouldbe preferred to the Bellman operator. Clearly, there are triv-ial MDPs on which any optimality-preserving operator per-forms equally well. However, we may ask whether thereis, for a given MDP, a maximally efficient optimality-preserving operator; and whether a learning agent can ben-efit from simultaneously searching for this operator whileestimating a value function.Concluding RemarksWe presented in this paper a family of optimality-preservingoperators, of which the consistent Bellman operator is a dis-tinguished member. At the center of our pursuits lay the de-sire to increase the action gap; we showed through experi-ments that this gap plays a central role in the performance ofgreedy policies over approximate value functions, and howsignificantly increased performance could be obtained by asimple modification of the Bellman operator. We believe ourwork highlights the inadequacy of the classical Q-functionat producing reliable policies in practice, calls into questionthe traditional policy-value relationship in value-based rein-forcement learning, and illustrates how revisiting the con-cept of value itself can be fruitful.AcknowledgmentsThe authors thank Michael Bowling, Csaba Szepesvari,Craig Boutilier, Dale Schuurmans, Marty Zinkevich, LihongLi, Thomas Degris, and Joseph Modayil for useful discus-sions, as well as the anonymous reviewers for their excellentfeedback.SpaceInvader[vanHasselt+10]20 ::2::DQNASTERIX BEAM RIDER PONGSEAQUEST SPACE INVADERSDQNPersistent A.L.A.L.Figure 7: Performance of trained agents in function of the parameter. Note that = 1.0 does not satisfy our theoremsconditions. We attribute the odd performance of Seaquest agents using Persistent Advantage Learning with = 0.9 to astatistical issue. 2 {0.0, 0.1, 0.3, 0.5, 0.7, 0.9, 1.0}=0 DQN21 [Mnih+15]4?? p1-p p=0.25 60 x10 DQN