|
Pure Mathematics 2024
随机线性二次控制的资格迹方法
|
Abstract:
本文研究了强化学习方法在线性二次控制问题(LQR)中的应用。在LQR问题的研究中,常见的方法通过求解代数黎卡提方程得到最优控制,并不直接优化控制增益。本文在策略梯度算法的基础上引入资格迹方法,直接优化控制增益矩阵。考虑已知和未知参数两种情况下,资格迹方法的收敛。在有限时域和高斯噪声的条件下,分别给出了已知和未知参数两种情况下算法的全局收敛保证。参数未知时,利用零阶优化定理近似梯度项,这可以将问题扩展至代价函数非凸的情况。数值模拟结果显示资格迹方法与梯度下降算法相比更快收敛,方差更小。
This paper studies the application of reinforcement learning method to linear quadratic regulator (LQR) problem. For the study of LQR problem, the usual method is to obtain the optimal control by solving the algebraic Riccati equation, but not to optimize the control gain directly. This paper op-timizes the control gain directly, proposes the eligibility trace method on the basis of gradient de-scent algorithm, and produces global convergence guarantee in the case of known and unknown parameters, in the setting of finite time horizon and Gaussian noise. When the parameters are unknown, the zero-order optimization theorem is used to approximate the gradient term, which can extend the problem to cases where the cost function is not convex. Numerical simulation results show that the eligibility trace method has faster convergence and smaller variance than gradient descent algorithm.
[1] | Birge, J. and Louveaux, F. (2011) Introduction to Stochastic Programming. Springer Science & Business Media, Heidelberg. https://doi.org/10.1007/978-1-4614-0237-4 |
[2] | Ku?era, V. (1992) Optimal Control: Linear Quadratic Methods: Brian D. O. Anderson and John B. Moore. Automatica, 28, 1068-1069. https://doi.org/10.1016/0005-1098(92)90166-D |
[3] | Sutton, R.S. and Barto, A.G. (2018) Reinforcement Learning: An Introduction. 2nd ed., the MIT Press, Cambridge. |
[4] | Basei, M., Guo, X., Hu, A. and Zhang, Y. (2020) Logarithmic Regret for Episodic Continuous-Time Linear-Quadratic Reinforcement Learning over a Finite-Time Horizon. Computation Theory eJournal.
https://doi.org/10.2139/ssrn.3848428 |
[5] | Dean, S., Mania, H., Matni, N., Recht, B. and Tu, S. (2017) On the Sample Com-plexity of the Linear Quadratic Regulator. Foundations of Computational Mathematics, 20, 633-679. https://doi.org/10.1007/s10208-019-09426-y |
[6] | Ren, Z., Zhong, A. and Li, N. (2021) LQR with Tracking: A Zeroth-Order Approach and Its Global Convergence. 2021 American Control Conference (ACC), New Orleans, LA, 25-28 May 2021, 2562-2568.
https://doi.org/10.23919/ACC50511.2021.9483417 |
[7] | Bertsekas, D.P. (2011) Approximate Policy Iteration: A Survey and Some New Methods. Journal of Control Theory and Applications, 9, 310-335. https://doi.org/10.1007/s11768-011-1005-3 |
[8] | Mania, H., Guy, A. and Recht, B. (2018) Simple Random Search Provides a Competitive Approach to Reinforcement Learning. arXiv preprint arXiv:1803.07055 |
[9] | Abbasi-Yadkori, Y., Lazic, N. and Szepesvari, C. (2019) Model-Free Linear Quadratic Control via Reduction to Expert Prediction. The 22nd International Conference on Artificial Intelligence and Statistics, Naha, 16-18 April 2019, 3108-3117. |
[10] | Mahdi, I. and Braga-Neto, U.M. (2018) Finite-Horizon lqr Controller for Partially-Observed Boolean Dynamical Systems. Automatica, 95, 172-179. https://doi.org/10.1016/j.automatica.2018.05.028 |
[11] | Zhang, H. and Li, N. (2022) Data-Driven Policy Iteration Algorithm for Continuous-Time Stochastic Linear-Quadratic Optimal Control Problems. Asian Journal of Control, 26, 481-489. https://doi.org/10.1002/asjc.3223 |
[12] | Farjadnasab, M. and Babazadeh, M. (2022) Model-Free LQR Design by Q-Function Learning. Automatica, 137, Article ID: 110060. https://doi.org/10.1016/j.automatica.2021.110060 |
[13] | Yaghmaie, F.A., Gustafsson, F.K. and Ljung, L. (2023) Linear Quadratic Control Using Model-Free Reinforcement Learning. IEEE Transactions on Automatic Control, 68, 737-752. https://doi.org/10.1109/TAC.2022.3145632 |
[14] | Tu, S. and Recht, B. (2019) The Gap between Model-Based and Model-Free Methods on the Linear Quadratic Regulator: An Asymptotic Viewpoint. Conference on Learning Theory, USA, 9 December 2019, 3036-3083. |
[15] | Malik, D., Pananjady, A., Bhatia, K., Khamaru, K., Bartlett, P.L. and Wainwright, M.J. (2018) Derivative-Free Methods for Policy Optimization: Guarantees for Linear Quadratic Systems. Journal of Machine Learning Research, 21, 1-21. |
[16] | Fazel, M., Ge, R. Kakade, S.M. and Mesbahi, M. (2018) Global Convergence of Policy Gradient Methods for the Linear Quadratic Regulator. International Conference on Machine Learning, Stockholm, 10-15 July 2018, 1467-1476. |
[17] | Hambly, B.M., Xu, R. and Yang, H. (2021) Policy Gradient Methods for the Noisy Linear Quadratic Regulator over a Finite Horizon. SIAM Journal on Control and Optimization, 59, 3359-3391. https://doi.org/10.1137/20M1382386 |
[18] | Shamir, O. (2017) An Optimal Algorithm for Bandit and Zero-Order Convex Optimization with Two-Point Feedback. The Journal of Machine Learning Research, 18, 1703-1713. |
[19] | Bu, J., Mesbahi, A. and Mesbahi, M. (2020) Policy Gradient-Based Algorithms for Continuous-Time Linear Quadratic Control. arXiv: 2006.09178. |
[20] | Bertsekas, D.P. (1995) Dynamic Programming and Optimal Control. 3rd Edition, Athena Scientific, Nashua, NH. |