RAS MathematicsЖурнал вычислительной математики и математической физики Computational Mathematics and Mathematical Physics

  • ISSN (Print) 0044-4669
  • ISSN (Online) 3034-533

OPTIMAL APPROXIMATION OF AVERAGE REWARD MARKOV DECISION PROCESSES

PII
S0044466925030074-1
DOI
10.31857/S0044466925030074
Publication type
Article
Status
Published
Authors
Volume/ Edition
Volume 65 / Issue number 3
Pages
325-337
Abstract
We continue to develop the concept of studying the ε-optimal policy for Average Reward Markov Decision Processes (AMDP) by reducing it to Discounted Markov Decision Processes (DMDP). Existing research often stipulates that the discount factor must not fall below a certain threshold. Typically, this threshold is close to one, and as is well-known, iterative methods used to find the optimal policy for DMDP become less effective as the discount factor approaches this value. Our work distinguishes itself from existing studies by allowing for inaccuracies in solving the empirical Bellman equation. Despite this, we have managed to maintain the sample complexity that aligns with the latest results. We have succeeded in separating the contributions from the inaccuracy of approximating the transition matrix and the residuals in solving the Bellman equation in the upper estimate so that our findings enable us to determine the total complexity of the epsilon-optimal policy analysis for DMDP across any method with a theoretical foundation in iterative complexity. Bybl. 17. Fig. 5. Table 1.
Keywords
Markov decision processes reinforcement learning sample complexity iteration complexity ICOMP2024
Date of publication
17.09.2025
Year of publication
2025
Number of purchasers
0
Views
23

References

  1. 1. Gheshlaghi Azar M., Munos R., Kappen H.J. Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model. Machine learning, vol. 91 (2013), pp. 325–349.
  2. 2. Zurek M., Chen Y. Span-Based Optimal Sample Complexity for Average Reward MDPs. arXiv preprint arXiv:2311.13469.
  3. 3. Tuynman A., Degenne R., Kaufmann E. Finding good policies in average-reward Markov Decision Processes without prior knowledge. arXiv preprint arXiv:2405.17108.
  4. 4. Wang S., Blanchet J., Glynn P. Optimal sample complexity for average reward markov decision processes. arXiv preprint arXiv:2310.08833.
  5. 5. Bartlett P.L., Tewari A. REGAL: A regularization based algorithm for reinforcement learning in weakly communicating MDPs. arXiv preprint arXiv:1205.2661.
  6. 6. Wang J., Wang M., Yang L.F. Near sample-optimal reduction-based policy learning for average reward mdp. arXiv preprint arXiv:2212.00603.
  7. 7. Goyal V., Grand-Clement J. A first-order approach to accelerated value iteration. Operations Research, vol. 71, no. 2 (2023), pp. 517–535.
  8. 8. Grand-Clement J. From convex optimization to MDPs: A review of first-order, second-order and quasi-Newton methods for MDPs. arXiv preprint arXiv:2104.10677.
  9. 9. Farahmand A.m., Ghavamzadeh M. PID accelerated value iteration algorithm. In International Conference on Machine Learning. PMLR, pp. 3143–3153.
  10. 10. Weissman T., Ordentlich E., Seroussi G., Verdu S., Weinberger M.J. Inequalities for the L1 deviation of the empirical distribution. Hewlett-Packard Labs, Tech. Rep (2003), p. 125.
  11. 11. Singh S.P., Yee R.C. An upper bound on the loss from approximate optimal-value functions. Machine Learning, vol. 16 (1994), pp. 227–233.
  12. 12. Li G., Wei Y., Chi Y., Gu Y., Chen Y. Breaking the sample size barrier in model-based reinforcement learning with a generative model. Advances in neural information processing systems, vol. 33 (2020), pp. 12 861–12 872.
  13. 13. Puterman M.L. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.
  14. 14. Wang S., Blanchet J., Glynn P. Optimal Sample Complexity of Reinforcement Learning for Mixing Discounted Markov Decision Processes. arXiv preprint arXiv:2302.07477.
  15. 15. Li T., Wu F., Lan G. Stochastic first-order methods for average-reward markov decision processes. arXiv preprint arXiv:2205.05800.
  16. 16. Jin Y., Sidford A. Towards tight bounds on the sample complexity of average-reward MDPs. In International Conference on Machine Learning. PMLR, pp. 5055–5064.
  17. 17. Tiapkin D., Gasnikov A. Primal-dual stochastic mirror descent for MDPs. In International Conference on Artificial Intelligence and Statistics. PMLR, pp. 9723–9740.
QR
Translate

Индексирование

Scopus

Scopus

Scopus

Crossref

Scopus

Higher Attestation Commission

At the Ministry of Education and Science of the Russian Federation

Scopus

Scientific Electronic Library