A perturbation approach to approximate value iteration for average cost Markov decision processes with Borel spaces and bounded costs

Vega-Amaya, Óscar; López-Borbón, Joaquín

About DML-CZ | FAQ | Conditions of Use | Math Archives | Contact Us

Previous | Up | Next

Article

Title:	A perturbation approach to approximate value iteration for average cost Markov decision processes with Borel spaces and bounded costs (English)
Author:	Vega-Amaya, Óscar
Author:	López-Borbón, Joaquín
Language:	English
Journal:	Kybernetika
ISSN:	0023-5954 (print)
ISSN:	1805-949X (online)
Volume:	55
Issue:	1
Year:	2019
Pages:	81-113
Summary lang:	English
.
Category:	math
.
Summary:	The present paper studies the approximate value iteration (AVI) algorithm for the average cost criterion with bounded costs and Borel spaces. It is shown the convergence of the algorithm and provided a performance bound assuming that the model satisfies a standard continuity-compactness assumption and a uniform ergodicity condition. This is done for the class of approximation procedures that can be represented by linear positive operators which give exact representation of constant functions and also satisfy certain continuity property. The main point is that these operators define transition probabilities on the state space of the controlled system. This has the following important consequences: (a) the approximating function is the average value of the target function with respect to the induced transition probability; (b) the approximation step in the AVI algorithm can be seen as a perturbation of the original Markov model; (c) the perturbed model inherits the ergodicity properties imposed on the original Markov model. These facts allow to bound the AVI algorithm performance in terms of the accuracy of the approximations given by this kind of operators for the primitive data model, namely, the one-step reward function and the system transition law. The bounds are given in terms of the supremum norm of bounded functions and the total variation norm of finite-signed measures. The results are illustrated with numerical approximations for a class of single item inventory systems with linear order cost, no set-up cost and no back-orders. (English)
Keyword:	Markov decision processes
Keyword:	average cost criterion
Keyword:	approximate value iteration algorithm
Keyword:	contraction and non-expansive operators
Keyword:	perturbed Markov decision models
MSC:	90C40
MSC:	90C59
MSC:	93E20
idZBL:	Zbl 07088880
idMR:	MR3935416
DOI:	10.14736/kyb-2019-1-0081
.
Date available:	2019-05-07T11:10:00Z
Last updated:	2020-02-27
Stable URL:	http://hdl.handle.net/10338.dmlcz/147707
.
Reference:	[1] Abounadi, J., Bertsekas, D., Borkar, V. S.: Learning algorithms for Markov decision processes with average cost..SIAM J. Control Optim. 40 (2001), 681-698. MR 1871450, 10.1137/s0363012999361974
Reference:	[2] Aliprantis, C. D., Border, K. C.: Infinite Dimensional Analysis. Third edition..Springer-Verlag, Berlin 2006. MR 2378491
Reference:	[3] Almudevar, A.: Approximate fixed point iteration with an application to infinite horizon Markov decision processes..SIAM J. Control Optim. 46 (2008), 541-561. MR 2448464, 10.1137/040614384
Reference:	[4] Araposthatis, A., Borkar, V. S., Fernández-Guacherand, E., Gosh, M. K., Marcus, S. I.: Discrete-time controlled Markov processes with average cost criterion: a survey..SIAM J. Control Optim. 31 (1993) 282-344. MR 1205981, 10.1137/0331018
Reference:	[5] Beutel, L., Gonska, H., Kacsó, D.: On variation-diminishing Shoenberg operators: new quantitative statements..In: Multivariate Approximation and Interpolations with Applications (M. Gasca, ed.), Monografías de la Academia de Ciencias de Zaragoza No. 20 2002, pp. 9-58. MR 1966063
Reference:	[6] Bertsekas, D. P.: Dynamic Programming: Deterministic and Stochastic Models..Prentice-Hall, Englewood Cliffs NJ 1987. Zbl 0649.93001, MR 0896902
Reference:	[7] Bertsekas, D. P., Tsitsiklis, J. N.: Neuro-Dynamic Programming..Athena Scientific, Belmont 1996. Zbl 0924.68163, MR 3444832
Reference:	[8] Bertsekas, D. P.: Approximate policy iteration: a survey and some new methods..J. Control Theory Appl. 9 (2011), 310-335. MR 2833999, 10.1007/s11768-011-1005-3
Reference:	[9] H., Chang, \S., Marcus, S. I.: Approximate receding horizon approach for Markov decision processes: average reward case..J. Math. Anal. Appl. 286 (2003), 636-651. MR 2008853, 10.1016/s0022-247x(03)00506-7
Reference:	[10] Chang, H. S., Hu, J., Fu, M. C., Marcus, S. I.: Simulation-Based Algorithms for Markov Decision Processes. Second edition..Springer-Verlag, London 2013. MR 3052425, 10.1007/978-1-4471-5022-0
Reference:	[11] Cooper, W. L., Henderson, S. G., Lewis, M. E.: Convergence of simulation-based policy iteration..Prob. Eng. Inform. Sci. 17 (2003), 213-234. MR 1961537, 10.1017/s0269964803172051
Reference:	[12] DeVore, R. A.: The Approximation of Continuous Functions by Positive Linear Operators..Lectures Notes in Mathematics 293. Springer-Verlag, Berlin, Heidelberg 1972. MR 0420083, 10.1007/bfb0059493
Reference:	[13] Dorea, C. C. Y., Pereira, A. G. C.: A note on a variations of Doeblin's condition for uniform ergodicity of Markov chains..Acta Math. Hungar. 110, Issue 4, (2006), 287-292. MR 2213230, 10.1007/s10474-006-0023-y
Reference:	[14] Prieto-Rumeau, F. Dufour adn T.: Approximation of Markov decision processes with general state space..J. Math. Anal. Appl. 388 (2012), 1254-1267. MR 2869823, 10.1016/j.jmaa.2011.11.015
Reference:	[15] Dufour, F., Prieto-Rumeau, T.: Stochastic approximations of constrained discounted Markov decision processes..J. Math. Anal. Appl. 413 (2014), 856-879. MR 3159809, 10.1016/j.jmaa.2013.12.016
Reference:	[16] Dufour, F., Prieto-Rumeau, T.: Approximation of average cost Markov decision processes using empirical distributions and concentration inequalities..Stochastics 87 (2015), 273-307. MR 3316812, 10.1080/17442508.2014.939979
Reference:	[17] Farias, D. P. de, Roy, B. van: On the existence of fixed points for approximate value iteration and temporal difference learning..J. Optim. Theory Appl. 105 (2000), 589-608. MR 1783879, 10.1023/a:1004641123405
Reference:	[18] Farias, D. P. de, Roy, B. van: Approximate linear programming for average-cots dynamic programming..In: Advances in Neural Information Processing Systems 15 (S. Becker, S. Thrun and K. Obermayer, eds.), MIT Press, Cambridge MA 2002, pp. 1587-1594.
Reference:	[19] Farias, D. P. de, Roy, B. Van: A cost-shaping linear program for average-cost approximate dynamic programming with performance guarantees..Math. Oper. Res. 31 (2006), 597-620. MR 2254426, 10.1287/moor.1060.0208
Reference:	[20] Gordon, G. J.: Stable function approximation dynamic programming..In: Proc. Twelfth International Conference on Machine Learning (A. Prieditis and S. J. Russell, eds.), Tahoe City CA 1995, pp. 261-268. 10.1016/b978-1-55860-377-6.50040-2
Reference:	[21] Gosavi, A.: A reinforcement learning algorithm based on policy iteration for average reward: empirical results with yield management and convergence analysis..Machine Learning 55 (2004), 5-29. MR 2549123, 10.1023/b:mach.0000019802.64038.6c
Reference:	[22] Hernández-Lerma, O.: Adaptive Markov Control Processes..Springer-Verlag, NY 1989. Zbl 0677.93073, MR 0995463, 10.1007/978-1-4419-8714-3
Reference:	[23] Hernández-Lerma, O., Lasserre, J. B.: Discrete-Time Markov Control Processes. Basic Optimality Criteria..Springer-Verlag, NY 1996. Zbl 0840.93001, MR 1363487, 10.1007/978-1-4612-0729-0
Reference:	[24] Hernández-Lerma, O., Lasserre, J. B.: Further Topics on Discrete-Time Markov Control Processes..Springer-Verlag, NY 1999. Zbl 0928.93002, MR 1697198, 10.1007/978-1-4612-0561-6
Reference:	[25] Hernández-Lerma, O., Lasserre, J. B.: Markov Chains and Invariant Probabilities..Birkhauser Verlag, Basel 2003. Zbl 1036.60003, MR 1974383, 10.1007/978-3-0348-8024-4
Reference:	[26] Hernández-Lerma, O., Montes-de-Oca, R., Cavazos-Cadena, R.: Recurrence conditions for Markov decision processes with Borel spaces: a survey..Ann. Oper. Res. 29 (1991), 29-46. MR 1105165, 10.1007/bf02055573
Reference:	[27] Hernández-Lerma, O., Vega-Amaya, O., Carrasco, G.: Sample-path optimality and variance-minimization of average cost Markov control processes..SIAM J. Control Optim. 38 (1999), 79-93. Zbl 0951.93074, MR 1740606, 10.1137/s0363012998340673
Reference:	[28] Jaskiewicz, A., Nowak, A. S.: On the optimality equation for average cost Markov control processes with Feller transitions probabilities..J. Math. Anal. Appl. 316 (2006), 495-509. MR 2206685, 10.1016/j.jmaa.2005.04.065
Reference:	[29] Klein, E., Thompson, A. C.: Theory of Correspondences..Wiley, New York 1984. MR 0752692
Reference:	[30] Konda, V. R., Tsitsiklis, J. N.: Actor-critic algorithms..SIAM J. Control Optim. 42 (2003), 1143-1166. MR 2044789, 10.1137/s0363012901385691
Reference:	[31] Lee, J. M., Lee, J. H.: Approximate dynamic programming strategies and their applicability for process control: a review and future direction..Int. J. Control Automat. Systems 2 (2004), 263-278.
Reference:	[32] Meyn, S. P., Tweedie, R. L.: Markov Chain and Stochastic Stability..Springer-Verlag, London 1993. MR 1287609, 10.1007/978-1-4471-3267-7
Reference:	[33] Montes-de-Oca, R., Lemus-Rodríguez, E.: An unbounded Berge's minimum theorem with applications to discounted Markov decision processes..Kybernetika 48 (2012), 268-286. Zbl 1275.90124, MR 2954325
Reference:	[34] Munos, R.: Performance bounds in $L_{p}$-norm for approximate value iteration..SIAM J. Control Optim. 47 (2007), 2303-2347. MR 2309039, 10.1137/s0363012904441520
Reference:	[35] Nowak, A. S.: A generalization of Ueno's inequality for n-step transition probabilities..Appl. Math. 25 (1998), 295-299. MR 1637730, 10.4064/am-25-3-295-299
Reference:	[36] Ortner, R.: Pseudometrics for state aggregation in average reward Markov decision processes..In: Algorithmic Learning Theory LNAI 4754 (M. Hutter, R. A. Serveido and E. Takimoto, eds.), Springer, Berlin, Heidelberg 2007, pp. 373-387. 10.1007/978-3-540-75225-7_30
Reference:	[37] Puterman, M. L.: Markov Decision Processes: Discrete Stochastic Dynamic Programming..Wiley, NY 1994. Zbl 1184.90170, MR 1270015, 10.1002/9780470316887
Reference:	[38] Powell, W. P.: Approximate Dynamic Programming. Solving the Curse of Dimensionality..John Wiley and Sons Inc., 2007. MR 2347698, 10.1002/9780470182963
Reference:	[39] Powell, W. P.: What you should know about approximate dynamic programming..Naval Res. Logist. 56 (2009), 239-249. MR 2503223, 10.1002/nav.20347
Reference:	[40] Powell, W. P.: Perspectives of approximate dynamic programming..Ann. Oper. Res. 241 (2012), 319-356. MR 3509421, 10.1007/s10479-012-1077-6
Reference:	[41] Powell, W. P., Ma, J.: A review of stochastic algorithms with continuous value function approximation and some new approximate policy iteration algorithms for multidimensional continuous applications..J. Control Theory Appl. 9 (2011), 336-352. MR 2834000, 10.1007/s11768-011-0313-y
Reference:	[42] Robles-Alcaraz, M. T., Vega-Amaya, O., Minjárez-Sosa, J. Adolfo: Estimate and approximate policy iteration algorithm for discounted Markov decision models with bounded costs and Borel spaces..Risk Decision Anal. 6 (2017), 79-95. 10.3233/rda-160116
Reference:	[43] Rust, J.: Numerical dynamic programming in economics..In: Handbook of Computational Economics, Vol. 13 (H. Amman, D. Kendrick and J. Rust, eds.), North-Holland, Amsterdam 1996, pp. 619-728. MR 1416619, 10.1016/s1574-0021(96)01016-7
Reference:	[44] Santos, M. S.: Analysis of a numerical dynamic programming algorithm applied to economic models..Econometrica 66 (1998), 409-426. MR 1612175, 10.2307/2998564
Reference:	[45] Santos, M. S., Rust, J.: Convergence properties of policy iteration..SIAM J. Control Optim. 42 (2004) 2094-2115. MR 2080931, 10.1137/s0363012902399824
Reference:	[46] Saldi, N., Yuksel, S., Linder, T.: Asymptotic optimality of finite approximations to Markov decision processes with Borel spaces..Math. Oper. Res. 42 (2017), 945-978. MR 3722422, 10.1287/moor.2016.0832
Reference:	[47] Stachurski, J.: Continuous state dynamic programming via nonexpansive approximation..Comput. Economics 31 (2008), 141-160. 10.1007/s10614-007-9111-5
Reference:	[48] Sutton, R. S., Barto, A. G.: Reinforcement Learning: An Introduction..MIT Press, Cambridge MA 1998. 10.1108/k.1998.27.9.1093.3
Reference:	[49] Roy, B. Van: Performance loss bounds for approximate value iteration with state aggregation..Math. Oper. Res. 31 (2006), 234-244. MR 2233994, 10.1287/moor.1060.0188
Reference:	[50] Vega-Amaya, O.: The average cost optimality equation: a fixed-point approach..Bol. Soc. Mat. Mexicana 9 (2003), 185-195. MR 1988598
Reference:	[51] Vega-Amaya, O.: Zero-sum average semi-Markov games: fixed point solutions of the Shapley equation..SIAM J. Control Optim. 42 (2003), 1876-1894. MR 2046390, 10.1137/s0363012902408423
Reference:	[52] Vega-Amaya, O.: Solutions of the average cost optimality equation for Markov decision processes with weakly continuous kernel: The fixed-point approach revisited..J. Math. Anal. Appl. 464 (2018), 152-163. MR 3794081, 10.1016/j.jmaa.2018.03.077
Reference:	[53] Vega-Amaya, O., López-Borbón, J.: A Perturbation approach to a class of discounted approximate value iteration algorithms with Borel spaces..J. Dyn. Games 3 (2016), 261-278. MR 3562878, 10.3934/jdg.2016014
Reference:	[54] Montes-de-Oca, O. Vega-Amaya amd R.: Application of average dynamic programming to inventory systems..Math. Methods Oper. Res. 47 (1998) 451-471. MR 1637569, 10.1007/bf01198405
.

Files

Files	Size	Format	View
Kybernetika_55-2019-1_6.pdf	695.1Kb	application/pdf	View/Open

Back to standard record

Browse
- Collections
- Titles
- Authors
- MSC

About DML-CZ

Partner of

Article

Files

Search

Browse