Previous |  Up |  Next

Article

Keywords:
Markov decision processes; average cost criterion; approximate value iteration algorithm; contraction and non-expansive operators; perturbed Markov decision models
Summary:
The present paper studies the approximate value iteration (AVI) algorithm for the average cost criterion with bounded costs and Borel spaces. It is shown the convergence of the algorithm and provided a performance bound assuming that the model satisfies a standard continuity-compactness assumption and a uniform ergodicity condition. This is done for the class of approximation procedures that can be represented by linear positive operators which give exact representation of constant functions and also satisfy certain continuity property. The main point is that these operators define transition probabilities on the state space of the controlled system. This has the following important consequences: (a) the approximating function is the average value of the target function with respect to the induced transition probability; (b) the approximation step in the AVI algorithm can be seen as a perturbation of the original Markov model; (c) the perturbed model inherits the ergodicity properties imposed on the original Markov model. These facts allow to bound the AVI algorithm performance in terms of the accuracy of the approximations given by this kind of operators for the primitive data model, namely, the one-step reward function and the system transition law. The bounds are given in terms of the supremum norm of bounded functions and the total variation norm of finite-signed measures. The results are illustrated with numerical approximations for a class of single item inventory systems with linear order cost, no set-up cost and no back-orders.
References:
[1] Abounadi, J., Bertsekas, D., Borkar, V. S.: Learning algorithms for Markov decision processes with average cost. SIAM J. Control Optim. 40 (2001), 681-698. DOI 10.1137/s0363012999361974 | MR 1871450
[2] Aliprantis, C. D., Border, K. C.: Infinite Dimensional Analysis. Third edition. Springer-Verlag, Berlin 2006. MR 2378491
[3] Almudevar, A.: Approximate fixed point iteration with an application to infinite horizon Markov decision processes. SIAM J. Control Optim. 46 (2008), 541-561. DOI 10.1137/040614384 | MR 2448464
[4] Araposthatis, A., Borkar, V. S., Fernández-Guacherand, E., Gosh, M. K., Marcus, S. I.: Discrete-time controlled Markov processes with average cost criterion: a survey. SIAM J. Control Optim. 31 (1993) 282-344. DOI 10.1137/0331018 | MR 1205981
[5] Beutel, L., Gonska, H., Kacsó, D.: On variation-diminishing Shoenberg operators: new quantitative statements. In: Multivariate Approximation and Interpolations with Applications (M. Gasca, ed.), Monografías de la Academia de Ciencias de Zaragoza No. 20 2002, pp. 9-58. MR 1966063
[6] Bertsekas, D. P.: Dynamic Programming: Deterministic and Stochastic Models. Prentice-Hall, Englewood Cliffs NJ 1987. MR 0896902 | Zbl 0649.93001
[7] Bertsekas, D. P., Tsitsiklis, J. N.: Neuro-Dynamic Programming. Athena Scientific, Belmont 1996. MR 3444832 | Zbl 0924.68163
[8] Bertsekas, D. P.: Approximate policy iteration: a survey and some new methods. J. Control Theory Appl. 9 (2011), 310-335. DOI 10.1007/s11768-011-1005-3 | MR 2833999
[9] H., Chang, \S., Marcus, S. I.: Approximate receding horizon approach for Markov decision processes: average reward case. J. Math. Anal. Appl. 286 (2003), 636-651. DOI 10.1016/s0022-247x(03)00506-7 | MR 2008853
[10] Chang, H. S., Hu, J., Fu, M. C., Marcus, S. I.: Simulation-Based Algorithms for Markov Decision Processes. Second edition. Springer-Verlag, London 2013. DOI 10.1007/978-1-4471-5022-0 | MR 3052425
[11] Cooper, W. L., Henderson, S. G., Lewis, M. E.: Convergence of simulation-based policy iteration. Prob. Eng. Inform. Sci. 17 (2003), 213-234. DOI 10.1017/s0269964803172051 | MR 1961537
[12] DeVore, R. A.: The Approximation of Continuous Functions by Positive Linear Operators. Lectures Notes in Mathematics 293. Springer-Verlag, Berlin, Heidelberg 1972. DOI 10.1007/bfb0059493 | MR 0420083
[13] Dorea, C. C. Y., Pereira, A. G. C.: A note on a variations of Doeblin's condition for uniform ergodicity of Markov chains. Acta Math. Hungar. 110, Issue 4, (2006), 287-292. DOI 10.1007/s10474-006-0023-y | MR 2213230
[14] Prieto-Rumeau, F. Dufour adn T.: Approximation of Markov decision processes with general state space. J. Math. Anal. Appl. 388 (2012), 1254-1267. DOI 10.1016/j.jmaa.2011.11.015 | MR 2869823
[15] Dufour, F., Prieto-Rumeau, T.: Stochastic approximations of constrained discounted Markov decision processes. J. Math. Anal. Appl. 413 (2014), 856-879. DOI 10.1016/j.jmaa.2013.12.016 | MR 3159809
[16] Dufour, F., Prieto-Rumeau, T.: Approximation of average cost Markov decision processes using empirical distributions and concentration inequalities. Stochastics 87 (2015), 273-307. DOI 10.1080/17442508.2014.939979 | MR 3316812
[17] Farias, D. P. de, Roy, B. van: On the existence of fixed points for approximate value iteration and temporal difference learning. J. Optim. Theory Appl. 105 (2000), 589-608. DOI 10.1023/a:1004641123405 | MR 1783879
[18] Farias, D. P. de, Roy, B. van: Approximate linear programming for average-cots dynamic programming. In: Advances in Neural Information Processing Systems 15 (S. Becker, S. Thrun and K. Obermayer, eds.), MIT Press, Cambridge MA 2002, pp. 1587-1594.
[19] Farias, D. P. de, Roy, B. Van: A cost-shaping linear program for average-cost approximate dynamic programming with performance guarantees. Math. Oper. Res. 31 (2006), 597-620. DOI 10.1287/moor.1060.0208 | MR 2254426
[20] Gordon, G. J.: Stable function approximation dynamic programming. In: Proc. Twelfth International Conference on Machine Learning (A. Prieditis and S. J. Russell, eds.), Tahoe City CA 1995, pp. 261-268. DOI 10.1016/b978-1-55860-377-6.50040-2
[21] Gosavi, A.: A reinforcement learning algorithm based on policy iteration for average reward: empirical results with yield management and convergence analysis. Machine Learning 55 (2004), 5-29. DOI 10.1023/b:mach.0000019802.64038.6c | MR 2549123
[22] Hernández-Lerma, O.: Adaptive Markov Control Processes. Springer-Verlag, NY 1989. DOI 10.1007/978-1-4419-8714-3 | MR 0995463 | Zbl 0677.93073
[23] Hernández-Lerma, O., Lasserre, J. B.: Discrete-Time Markov Control Processes. Basic Optimality Criteria. Springer-Verlag, NY 1996. DOI 10.1007/978-1-4612-0729-0 | MR 1363487 | Zbl 0840.93001
[24] Hernández-Lerma, O., Lasserre, J. B.: Further Topics on Discrete-Time Markov Control Processes. Springer-Verlag, NY 1999. DOI 10.1007/978-1-4612-0561-6 | MR 1697198 | Zbl 0928.93002
[25] Hernández-Lerma, O., Lasserre, J. B.: Markov Chains and Invariant Probabilities. Birkhauser Verlag, Basel 2003. DOI 10.1007/978-3-0348-8024-4 | MR 1974383 | Zbl 1036.60003
[26] Hernández-Lerma, O., Montes-de-Oca, R., Cavazos-Cadena, R.: Recurrence conditions for Markov decision processes with Borel spaces: a survey. Ann. Oper. Res. 29 (1991), 29-46. DOI 10.1007/bf02055573 | MR 1105165
[27] Hernández-Lerma, O., Vega-Amaya, O., Carrasco, G.: Sample-path optimality and variance-minimization of average cost Markov control processes. SIAM J. Control Optim. 38 (1999), 79-93. DOI 10.1137/s0363012998340673 | MR 1740606 | Zbl 0951.93074
[28] Jaskiewicz, A., Nowak, A. S.: On the optimality equation for average cost Markov control processes with Feller transitions probabilities. J. Math. Anal. Appl. 316 (2006), 495-509. DOI 10.1016/j.jmaa.2005.04.065 | MR 2206685
[29] Klein, E., Thompson, A. C.: Theory of Correspondences. Wiley, New York 1984. MR 0752692
[30] Konda, V. R., Tsitsiklis, J. N.: Actor-critic algorithms. SIAM J. Control Optim. 42 (2003), 1143-1166. DOI 10.1137/s0363012901385691 | MR 2044789
[31] Lee, J. M., Lee, J. H.: Approximate dynamic programming strategies and their applicability for process control: a review and future direction. Int. J. Control Automat. Systems 2 (2004), 263-278.
[32] Meyn, S. P., Tweedie, R. L.: Markov Chain and Stochastic Stability. Springer-Verlag, London 1993. DOI 10.1007/978-1-4471-3267-7 | MR 1287609
[33] Montes-de-Oca, R., Lemus-Rodríguez, E.: An unbounded Berge's minimum theorem with applications to discounted Markov decision processes. Kybernetika 48 (2012), 268-286. MR 2954325 | Zbl 1275.90124
[34] Munos, R.: Performance bounds in $L_{p}$-norm for approximate value iteration. SIAM J. Control Optim. 47 (2007), 2303-2347. DOI 10.1137/s0363012904441520 | MR 2309039
[35] Nowak, A. S.: A generalization of Ueno's inequality for n-step transition probabilities. Appl. Math. 25 (1998), 295-299. DOI 10.4064/am-25-3-295-299 | MR 1637730
[36] Ortner, R.: Pseudometrics for state aggregation in average reward Markov decision processes. In: Algorithmic Learning Theory LNAI 4754 (M. Hutter, R. A. Serveido and E. Takimoto, eds.), Springer, Berlin, Heidelberg 2007, pp. 373-387. DOI 10.1007/978-3-540-75225-7_30
[37] Puterman, M. L.: Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley, NY 1994. DOI 10.1002/9780470316887 | MR 1270015 | Zbl 1184.90170
[38] Powell, W. P.: Approximate Dynamic Programming. Solving the Curse of Dimensionality. John Wiley and Sons Inc., 2007. DOI 10.1002/9780470182963 | MR 2347698
[39] Powell, W. P.: What you should know about approximate dynamic programming. Naval Res. Logist. 56 (2009), 239-249. DOI 10.1002/nav.20347 | MR 2503223
[40] Powell, W. P.: Perspectives of approximate dynamic programming. Ann. Oper. Res. 241 (2012), 319-356. DOI 10.1007/s10479-012-1077-6 | MR 3509421
[41] Powell, W. P., Ma, J.: A review of stochastic algorithms with continuous value function approximation and some new approximate policy iteration algorithms for multidimensional continuous applications. J. Control Theory Appl. 9 (2011), 336-352. DOI 10.1007/s11768-011-0313-y | MR 2834000
[42] Robles-Alcaraz, M. T., Vega-Amaya, O., Minjárez-Sosa, J. Adolfo: Estimate and approximate policy iteration algorithm for discounted Markov decision models with bounded costs and Borel spaces. Risk Decision Anal. 6 (2017), 79-95. DOI 10.3233/rda-160116
[43] Rust, J.: Numerical dynamic programming in economics. In: Handbook of Computational Economics, Vol. 13 (H. Amman, D. Kendrick and J. Rust, eds.), North-Holland, Amsterdam 1996, pp. 619-728. DOI 10.1016/s1574-0021(96)01016-7 | MR 1416619
[44] Santos, M. S.: Analysis of a numerical dynamic programming algorithm applied to economic models. Econometrica 66 (1998), 409-426. DOI 10.2307/2998564 | MR 1612175
[45] Santos, M. S., Rust, J.: Convergence properties of policy iteration. SIAM J. Control Optim. 42 (2004) 2094-2115. DOI 10.1137/s0363012902399824 | MR 2080931
[46] Saldi, N., Yuksel, S., Linder, T.: Asymptotic optimality of finite approximations to Markov decision processes with Borel spaces. Math. Oper. Res. 42 (2017), 945-978. DOI 10.1287/moor.2016.0832 | MR 3722422
[47] Stachurski, J.: Continuous state dynamic programming via nonexpansive approximation. Comput. Economics 31 (2008), 141-160. DOI 10.1007/s10614-007-9111-5
[48] Sutton, R. S., Barto, A. G.: Reinforcement Learning: An Introduction. MIT Press, Cambridge MA 1998. DOI 10.1108/k.1998.27.9.1093.3
[49] Roy, B. Van: Performance loss bounds for approximate value iteration with state aggregation. Math. Oper. Res. 31 (2006), 234-244. DOI 10.1287/moor.1060.0188 | MR 2233994
[50] Vega-Amaya, O.: The average cost optimality equation: a fixed-point approach. Bol. Soc. Mat. Mexicana 9 (2003), 185-195. MR 1988598
[51] Vega-Amaya, O.: Zero-sum average semi-Markov games: fixed point solutions of the Shapley equation. SIAM J. Control Optim. 42 (2003), 1876-1894. DOI 10.1137/s0363012902408423 | MR 2046390
[52] Vega-Amaya, O.: Solutions of the average cost optimality equation for Markov decision processes with weakly continuous kernel: The fixed-point approach revisited. J. Math. Anal. Appl. 464 (2018), 152-163. DOI 10.1016/j.jmaa.2018.03.077 | MR 3794081
[53] Vega-Amaya, O., López-Borbón, J.: A Perturbation approach to a class of discounted approximate value iteration algorithms with Borel spaces. J. Dyn. Games 3 (2016), 261-278. DOI 10.3934/jdg.2016014 | MR 3562878
[54] Montes-de-Oca, O. Vega-Amaya amd R.: Application of average dynamic programming to inventory systems. Math. Methods Oper. Res. 47 (1998) 451-471. DOI 10.1007/bf01198405 | MR 1637569
Partner of
EuDML logo